Scalable communication event tracing via clustering

نویسندگان

  • Amir Bahmani
  • Frank Mueller
چکیده

Communication traces help developers of high-performance computing (HPC) applications understand and improve their codes. When run on large-scale HPC facilities, the scalability of tracing tools becomes a challenge. To address this problem, traces can be clustered into groups of processes that exhibit similar behavior. Instead of collecting trace information of each individual node, it then suffices to collect a trace of a small set of representative nodes, namely one per cluster. However, clustering algorithms themselves need to have low overhead, be scalable, and adapt to application characteristics. We devised an adaptive clustering algorithm for large-scale applications called ACURDION that traces the MPI communication of code with O(log P) time complexity. First, ACURDION identifies the parameters that differ across processes by using a logarithmic algorithm called Adaptive Signature Building. Second, it clusters the processes based on those parameters. Experiments show that collecting traces of just nine nodes/clusters suffices to capture the communication behavior of all nodes for a wide set of HPC benchmarks codes while retaining sufficient accuracy of trace events and parameters. In summary, ACURDION improves trace scalability and automation over prior approaches.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Integrated Runtime Measurement Summarisation and Selective Event Tracing for Scalable Parallel Execution Performance Diagnosis

Straightforward trace collection and processing becomes increasingly challenging and ultimately impractical for more complex, longrunning, highly-parallel applications. Accordingly, the kojak measurement system for mpi, openmp and shmem parallel applications is incorporating runtime management and summarisation capabilities. This offers a more scalable and effective profile of parallel executio...

متن کامل

Efficient clustering for ultra-scale application tracing

Extreme-scale computing poses a number of challenges to application performance. Developers need to study application behavior by collecting detailed information with the help of tracing toolsets to determine shortcomings. But not only applications are “scalability challenged”, current tracing toolsets also fall short of exascale requirements for low background overhead since trace collection f...

متن کامل

Scalable Induction of Probabilistic Real-Time Automata Using Maximum Frequent Pattern Based Clustering

The paper presents a scalable method for learning probabilistic real-time automata (PRTAs), a new type of model that captures the dynamics of multi-dimensional event logs. In multi-dimensional event logs, events are described by several features instead of only one symbol. Moreover, it is not clear up front which events occur in an event log. The learning method to find a PRTA that models such ...

متن کامل

Towards Scalable Event Tracing for High End Systems

Although event tracing of parallel applications offers highly detailed performance information, tracing on current leading edge systems may lead to unacceptable perturbation of the target program and unmanageably large trace files. High end systems of the near future promise even greater scalability challenges. Development of more scalable approaches requires a detailed understanding of the int...

متن کامل

Scalable Community Detection through Content and Link Analysis in Social Networks

Social network analysis is an important problem that has been attracting a great deal of attention in recent years. Such networks provide users many different applications and features; as a result, they have been mentioned as the most important event of recent decades. Using features that are available in the social networks, first discovering a complete and comprehensive communication should ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Parallel Distrib. Comput.

دوره 109  شماره 

صفحات  -

تاریخ انتشار 2017